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Abstract 


Two new information-theoretic methods are introduced for establishing Poisson ap¬ 
proximation inequalities. First, using only elementary information-theoretic techniques it 
is shown that, when W is the sum of the (possibly dependent) binary random 

variables Ai, A 2 ,..., A„, with E{Xi) = pi and E{Sn) = A, then 


n 


n 


n(PsJ|Po(A))<N:r? + [E H{Xi) - iJ(Ai,A2,...,A„) 


where i2(Ps^ ||Po(A)) is the relative entropy between the distribution of Sn and the 
Poisson(A) distribution. The first term in this bound measures the individual small¬ 
ness of the Xi and the second term measures their dependence. A general method is 
outlined for obtaining corresponding bounds when approximating the distribution of a 
sum of general discrete random variables by an infinitely divisible distribution. 

Second, in the particular case when the Xi are independent, the following sharper 
bound is established. 



and it is also generalized to the case when the Xi are general integer-valued random 
variables. Its proof is based on the derivation of a subadditivity property for a new discrete 
version of the Fisher information, and uses a recent logarithmic Sobolev inequality for 
the Poisson distribution. 
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1 Introduction 


Let Xi,X 2 , ■ ■ ■, Xn be binary random variables. A classical result in probability states that, 
if the Xi are independent and identically distributed (i.i.d.) with common parameter = 
E{Xi) = A/n, then, when n is large, the distribution of their sum 

Sn = Xi+X2 + ---+Xn 


is close to Po(A), the Poisson distribution with parameter A. More generally, analogous results 
apply when the Xi are possibly dependent and not necessarily identically distributed. The 
distribution of Sn is close to Po(A) as long as: 

(а) The sum YlPi of fh® parameters pi of the Xi is close to A. 

(б) None of the Xi dominate the sum, i.e., all the pi are small. 


(c) The variables Xi are not strongly dependent. 

Such results are often referred to as “laws of small numbers” or “Poisson approximation 
results.” See mim Section 2.6] [S] for details. 

Our purpose here is to illustrate how techniques based on information-theoretic ideas can 
be used to establish general Poisson approximation inequalities. In Section |21 we prove: 

Proposition 1. Poisson Approximation in Relative Entropy. If Sn = XlILi ^ 

(possibly dependent) binary random variables Xi,X 2 ,. ■ ■ ,Xn with parameters pi = E{Xi) 
and with E{Sn) = Y17=iPi ~ then the distribution Ps^ of Sn satisfies 


Z)(PsJ|Po(A)) < ^ - HiXuX2,...,Xn) 

i=l i=l 


( 1 ) 


For two probability distributions P and Q on a discrete set S', the relative entropy 
between P and Q is defined as D{P\\Q) = aiid Ibe entropy of a dis¬ 

crete random variable (or random vector) X with distribution P on S is H{X) = H[P) = 
— where log denotes the natural logarithm. 

Whenever (a), (b) and (c) hold we expect the two terms in the right-hand side of (P) 
to be small, and hence the distribution of S„ to be close to Po(A) in the relative entropy 
sense. Although D{P\\Q) is not a proper metric, it is a natural measure of “dissimilarity” 
in the context of statistics Ch. 12], and it can be used to define a topology on 

probability measures m- Also, bounds in relative entropy can be translated into bounds in 
total variation via Pinsker’s inequality HH 

l\\P-Q\\ly<D{P\\Q). (2) 

For example, if the Xi are independent o reduces to 

n 

D{PsjPo{X)) <Y,pI ( 3 ) 

i=l 
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Although this is reminiscent of the simple total-variation bound due to Le Cam m, 

n 

II^5„-Po(A)||tv<E^’' 

i=l 


(which, incidentally, only holds when the Xi are independent), applying Pinsker’s inequality 
©to © leads to the suboptimal bound 




P0(A)||rpv < 



2=1 


( 4 ) 


The proof of Proposition 1 uses only elementary information-theoretic facts that are 
established using little more than Jensen’s inequality. To get sharper bounds for the case 
of independent random variables Xi, in Section 0 we employ a new discrete version of the 
Fisher information which we call scaled Fisher information, and we prove: 

Theorem 1. Poisson Approximation for Independent Variables: If Sn = XlILi 

of n independent binary random variables Xi,X2, ■ ■ ■, Xn, with E{Sn) = Pi = A, 


D{Ps jPo{X))< 




Pi 


2=1 


Pi 


( 5 ) 


The proof of Theorem 1 combines a natural discrete analog of Stam’s subbativity of the 
Fisher information mm, and a recent logarithmic Sobolev inequality of Bobkov and Ledoux 
[S] . As we discuss extensively in Section 01 Theorem 1 is a significant improvement over 
Proposition 1, and in certain cases it leads to total variation bounds that are asymptotically 
optimal up to multiplicative constants in the convergence rate. Moreover, m is a nontrivial 
improvement over existing results, as it gives a bound for the relative entropy and not just 
the total variation distance. 

For an information-theoretic interpretation, consider a triangular array of binary random 
variables {{X^\x^\ ..., X^'^), n > 1}, such that the right-hand side of (P) goes to zero as 
n —> oo (as, for example, when the are i.i.d. Bernoulli(A/n)). Then the distribution of 
Sn converges to Po(A), i.e., Ps„ comes closer and closer to the “most random” distribution 
among all those that can be obtained by summing a finite number of Bernoulli random 
variables: Let 'P(A) denote the set of all distributions of sums Sn of n independent binary 
random variables with E(Sn) = A, for any finite n. Then m, 


H{Po{X)) = snp{H{P) : P e P(A)}. 


So, roughly and somewhat incorrectly speaking, the entropy of Sn “increases” to the maxi¬ 
mum entropy H{Po{X)) as n grows. This invites a tempting analogy with the second law of 
thermodynamics, stating that the uncertainty of a physical system increases with time, until 
the system reaches equilibrium in its maximum entropy state. 

Corresponding information-theoretic interpretations and proofs have been given for nu¬ 
merous classical results of proability theory, including the central limit theorem mumm, 
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the convergence of Markov chains j^-ilj|24j[F)]. many large deviations results |12j|16|[Tl^. the 
martingale convergence theorem iniini, and the Hewitt-Savage 0-1 law m- See also the 
powerful comments in na pp. 211,215]. Finally, we mention that Johnstone and MacGib- 
bon considered the problem of Poisson convergence from the information theory angle in m- 
Their approach is different from ours, and parallels that in mm for the central limit theorem. 

2 General Bounds in Relative Entropy 

Before giving the proof of Proposition 1 we introduce some some notation and briefly recall 
two elementary, well-known facts. The hrst one formalizes the intuitive idea that we cannot 
do better in a hypothesis test by simply pre-processing the data. Suppose X and Y are 
random variables with distributions P and Q, respectively, let / be an arbitrary function, 
and write P',Q' for the distribution of f{X) and f(Y), respectively. The following “data 
processing” inequality is an easy consequence of Jensen’s inequality na Lemma 1.3.11], 

D{P'\\Q') < D{P\\Q). 

Next, given X and Y with joint distribution Px,y and marginals Px and Py, let I{X]Y) = 
H{X) — P[{X\Y) denote their mutual information. The “chain rule” is the simple expansion, 

D{Px,y\\Qx X Qy) = D{Px\\Qx) + D{Py\\Qy) + I{X; T), 

for any two probability distributions Qx and Qy- 

Proof of Proposition 1. If we define 5^ = where Zi are independent Poisson(pj) 

random variables, then the distribution Pg/ of S'^ is Po(A) and 

D{PsAPo{X)) = D{PsAPsO 

< D{Pxi,...,xAPzi,...,Zr,) 

n n—1 

Y, D{Px. llPo(p,)) + E • • • ’ (6) 

i=l i=l 

where (a) follows from the data processing inequality, and (6) follows by applying the chain 
rule {n — 1) times. Using simple calculus we obtain the bound 

D(Bern(p)llPo(p)) = (l-p)log ^^ +p\og-^ < 

e P pe P 

which, applied to each term in the first sum in gives, 

n n—1 

D{PsjPo{X)) < (7) 

i=l i=l 

n n 

i=l i=l 

where in the last step we expanded the definition of the mutual informations. □ 
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The first term in the above bound makes precise what we mean by the requirement that 
“all the Pi be small” whereas the second term quantifies their degree of dependence. It is 
worth noting that this difference between the sum of the entropies of the Xi and their joint 
entropy can also be written as the relative entropy D{Px”\\Pxi x • • • x Px„) between their 
joint distribution and the product of their marginals. This expression also admits a natural 
interpretation as a measure of how far the Xi are from being independent. 

As indicated in the introduction, although the result of Proposition 1 is generally good 
enough to prove convergence to the Poisson distribution, for finite n it often gives a suboptimal 
convergence rate. This is also illustrated in the following two examples. 


A Markov Chain. Let 
variables such that each row 


X 


(n) 




(n) 


X^n^), 

1 re > 


is a A 

' ^ 

1 

n+l 

n+1 

n—l 

2 

^ n+1 

n+1 


n > 1} be a triangular array of binary random 


and with each X^^”^ having (the stationary) Bernoulli(i) distribution. The convergence of the 

distribution of to Po(l) is a well-studied problem; see and the references 

therein. Applying Proposition 1 (or, equivalently, inequality 0 ) in this case translates to 


i>(Ps„l|Po(l)) < 


E 

2=1 


n" 


n—l 

2=1 


— + {n 
n 


1)/(X;");X(")) 


since /(X^^”^; (Xj^”{,..., xi”^)) = /(X^-; Xj^”{) by the Markov property, and stationarity 
implies that /(X^^”^; = /(xj”"^; X^”^). A straightforward calculation yields that 


(n - l)/(xf ); Xf)) = (n - 1) [h(i) - h{^) 


'll ^7/1 \ ^ ^7/2\ 


where h{p) denotes the binary entropy function h{p) = —p\ogp — (1 — p)log(l — p), and 
simple calculus shows that all three terms above converge to zero as n —> oo. In fact, this 
expression can be bounded above by 


, . 1 X , logn logn 

'•(sTt) + — S 3—. 

where the last inequality holds for all n > 3, so putting it all together. 


D(PsJ|Po(l)) < 3l5|^ + i. 

[A corresponding bound can similarly be derived if instead of stationarity we assume that 


xj”^ has Pi''’ = E{X^''’) < 1/n.] As mentioned above, although this bound is sufficient to 
prove that Ps^ converges to the Poisson distribution, it leads to a convergence rate in total 
variation of order (logn)/n, compared to the 0(l/n) bound derived in 0132] [all. 
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A Compound Poisson Approximation Example. Let be independent Bernoulli 

random variables with parameters pi = E{Xi), write A = ^11=1 Pi-: let ai, 02 , • • • > ctn be 

i.i.d., independent of the Xj, with distribution 

_ f 1 with prob 1/2 
I 2 with prob 1/2. 

We will show that the distribution of the sum 

n 

Sn = Y.aiXi 
i=l 

is close to the compound Poisson distribution with parameters (A/2, A/2), which we denote 
by Po(A/2,A/2). Recall that if Zi and Z 2 are i.i.d. Poisson(A/2) random variables, then 
Z = {Zi + 2 Z 2 ) has Po(A/2,A/2) distribution. Alternatively, we can write Z = L) 

where the 1/ are independent Po(pi/2,pj/2) random variables. Arguing as before, the data 
processing inequality and the chain rule imply that 

n 

D{PsAPo{\/2,\/2)) < 

i=l 


and it is straightforward to calculate 

D{Pa^Xi\\PYi) <Pi+ {l-pi)\pi + log{l-pi)] - ^log(l +pi/4) ^pf, 

so that 

n 

D{PsjPo{X/2,X/2)) <Y,pf- 

i=l 

A general method. Finally, we outline a simple general strategy for approximating the 
distribution Ps„ of the sum of n nonnegative-integer-valued random variables Xi , A 2 ,..., X„ 
by the distribution of some infinitely divisible discrete random variable Z with E(Sn) = E{Z). 

First, use the infinitely divisibility of Pz to represent Z as Z = XliLi ^ where the 1/ are 
independent and have the same distribution as Z but with different parameters. Then apply 
the data processing inequality and the chain rule as before to obtain 

n n 

D{Ps„\\Pz) < D{Px, II^L) + [ E ’ 

i=l i=l 

and finally, estimate the last two terms in above inequality. The first term should be small if 
the Xi are individually small and well-approximated by the corresponding F), and the second 
term should be small if the Xi are sufficiently weakly dependent. 
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3 Tighter Bounds for Independent Random Variables 


Next we take a different point of view that yields tighter bounds than Proposition 1. Recall 
that in \Z2\ |3U|pn]. the Fisher information of a random variable X with distribution P on 
Z_|_ = {0, 1,2,...}, is defined in a way analogous to that for continuous random variables, via 


J{X) = E 


P{X - 1) - P{X)\P 

PW ^ - 


with the convention that P(—1) = 0. However, as Kagan acknowledges, this definition 
is really only useful if X is supported on the entire If X has bounded support then for 
some n, P{n) > 0 but P(n + 1) = 0, which implies that J{X) = oo. 

Partly in order to avoid this difficulty, we proceed along a different route. Recalling that 
the Poisson distribution is characterized by the recurrence XP{x) = (x + l)P{x + 1) for all 
X, we let the scaled score function of a random variable X with mean A and distribution P 
on be 


Px{x) 


{x + l)P{x + 1) 
XP{x) 


X e z+, 


and we define the scaled Fisher information of X as 


K{X) = XE[px{Xf]. 


From this we easily see that 

K{X) > 0 

with equality iff px{X) = 0 with probability 1, i.e., iff X is has a Poisson(A) distribution. 
Moreover, as we show next, the smaller the value of K{X), the closer P is to the Poisson(A) 
distribution. The proof of Proposition 2, given in Section 3.2, is an easy consequence of a 
recent logarithmic Sobolev inequality of Bobkov and Ledoux [S]. 

Proposition 2. Relative Entropy and K{X): If X is a random variable with distribution P 
on Z_|_ and with E{X) = A, then 


P(P||Po(A)) < K{X), (8) 

as long as either P has full support (i.e., P{k) >0 for all k), or finite support (i.e., there 
exists N G Z_|_ such that P[k) = 0 for all k > N). 

Note that from Q and Pinsker’s inequality m we have that 

||P-Po(A)||tv < ^2K{X). (9) 

We also give a direct proof of in Section 3.2, based on a simple Poincare inequality for 
the Poisson measure. 
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3.1 Results 


The main step in the proof of Theorem 1 will be to establish a form of subadditivity for the 
scaled Fisher information. It is worth noting that in the Gaussian case the Fisher information 
is also subadditive mm, but, in contrast to the present setting, subadditivity alone does 
not suffice to prove the central limit theorem [^. Proposition 3 is proved in Section 3.2. 

Proposition 3. Subadditivity of Scaled Fisher Information: If Sn = the sum of 

n independent integer-valued random variables Xi, X 2 ,..., X^, with means E{Xi) = pi and 
E{Sn) = YIi=iPi = then 

n 

KiSn)<J2jK{X,). 

i=l 


Proof of Theorem 1. If the Xi are independent Bernoulli(pj) random variables with 
Yl'i=iPi — then K{Xi) =p?/(l — Pi) and Proposition 3 gives 


K{Sn) < 


n 

2=1 


Pi 


■Pi 


Combining this with X = S'n in Proposition 2 yields inequality ©• 


□ 


Example 1. If the Xi are i.i.d. Bernoulli(A/n) random variables, from Theorem 1 combined 
with Pinsker’s inequality 0 we obtain that for any e > 0, 


ll-P^n - Po(A)||Ty < (2 + e)^, for n > A/e. 

This is a definite improvement over the earlier ‘IXjy/n bound from Q, and, except for the 
constant factor, it is asymptotically of the right order; see mm for details. 

Example 2. If the Xi are i.i.d. Bernoulli(///y^) random variables. Theorem 1 together with 
Pinsker’s inequality © yield. 


\\Ps^ Po(^^)|Itv ^ 

which is of the same order as the optimal asymptotic rate, as n —> 00 , 

\\PSn - Po(/iV^)|| ~ -^v^l/(27re) 

y n 

derived in m- 

Example 3. If the Xi are Geometric random variables with respective distributions Pi{x) = 
(1 — qiYqi, X > 0, then K{Xi) = (1 — qi^/qi. Letting Sn = and assuming that 

E{Sn) = ~ combining Proposition 3 and the bound © yields 


lift. - Po(A)||tv < 


\ 


O 


(1 - ?i)3 


2=1 
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In particular, taking all the qi = n/(n + A) gives the elegant estimate 


II^s.-Po(A)||tv< 


V2X 


y^n{n + A) 


n 


To see how tight the result of Proposition 3 is in general, note that the following lower 
bound of Cramer-Rao type holds: Since for all a and any random variable S with mean A 
and variance ci^, 

(T^ — A 
A 

choosing a = — A)/((T^A), we obtain that 

K{S) > {a^ - Xf 

In Example 1 where S = Sn = ^ i.i.d. Bernoulli(A/n) random 

variables, the lower bound m coincides with the upper bound given in Proposition 3. 
Similarly, in Example 3 with all the Qi = n/(n + A), the upper bound from Proposition 3 
holds with equality. Therefore, any remaining slackness in our bounds comes from either 
Proposition 2 or Pinsker’s inequality. 

Finally, in Proposition 4 below we establish a formal connection between relative entropy 
and the probability distribution (x + l)P(x + 1)/A implicitly used in our definition of the 
scaled Fisher information. It is proved in the next section. 

Proposition 4- Let X be an integer-valued random variable with distribution P and mean A. 
If X is the sum of independent Bernoulli random variables, then 

poo ^ 

D{P\\Po{X))= / D{Pt\\Pt)dt, (II) 

Jo 

where Pt{r) = Pi{Xt = r) is the distribution of Xt = X-|-Po(t) where Po(t) is an independent 
Poisson(t) random variable, and Pt{r) = (r + l)Pr(Xt = r -\- 1)/(A + t). More generally, the 
same result holds for any random variable X that has K{X) < oo and satisfies the logarithmic 
Sobolev inequality of Proposition 2. 

This result is reminiscent of the well-known de Bruijn identity, which states that the 
(differential) relative entropy between a random variable X and a Gaussian with the same 
variance can be written as a weighted integral of (continuous) Fisher informations of convex 
combinations of X and an independent A^(0, t) random variable; see jllj[lj. In a similar vain, 
if we formally expand the logarithm in the integrand in m as a Taylor series, then the 
first term in the expansion (the quadratic term) turns out to be equal to K{Xt)/2{X -|- t). 
Therefore, 

giving an alternative formula to Proposition 2, also relating scaled Fisher information and 
relative entropy. 



0 < XE{psiS) - a{S - X)f = K{S) + A 


2 2 
a a 


— 2a 


9 






3.2 Proofs 


Although in several places below we formally divide by a quantity which may be zero, this is 
taken care of by the usual conventions, 01og(0/a) = 0, 01og(0/0) = 0, and 01og(a/0) = oo, 
for any a > 0. 

Proof of Proposition 2. Let Po\{k) denote the Po(A) probabilities. In the case when P 
has full support, the result follows immediately from Corollary 4 of [H], upon considering the 
function f{k) = P{k)/Pox{k), k > 0. 

In the case of finite support, for e > 0 let have the mixture distribution 


= (1 - e)PoA + eP. 

Then £'(X^) = A and P^ has full support, so by the previous part, 

P(PiPo(A)) < K{X^). 


( 12 ) 


But since P{k) = 0 ior k > N + I, then P^{k)/P{k) = e for those k, and letting e J, 0 in the 
left hand side of m we get 


N 

D(P'\\Fo(\)) = Y, P^{k)log 

k=0 


\ P"{k) 1 
-Poa(A:)- 


+ Pr{Po(A) > A^jeloge ^ L»(P||Po(A)). 


Moreover, 


so 


{k + l)P^{k + l) 
\P^{k) 


= 1, k>N + l, 


N 


K{X^) = Y,P^{k) 


k=0 


{k + l)P^{k + l) 
XP^{k) 


- 1 


1 2 


KiX), 


as e I 0, and this completes the proof. 


□ 


Next we prove the bound in using a classical Poincare inequality for the Poisson 
distribution. We actually establish the following (apparently stronger) bound for the Bellinger 
distance ||P — Po(A)||h between P and Po(A): 


p - Po(A)||2 V < IIP - Po(A)|||, < 2K{X). 


Proof of 0 . For any function / : Z+ —> M, define A/(x) = /(x +1) — /(x). It is well-known 
that, writing Poa(x) for the Poisson(A) probabilities, then for all functions g in L‘^{q), 


Pox{x){g{x) - nf <XY PoA(a;)(A 5 f(x))^ (13) 

X X 

where g = X)® 5(3^)Poa(x) is the mean of g under Po(A); see for example Klaassen jJH]. 
Using the simple fact that 

[x/u — Vf < [x/u — l)‘^{x/u + 1)^ = (u — 1)^, for all u> 0, 
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we get that 


K{X) 


xY,p{x) 


f P{x + I)Poa(x) 
\Pox{x + l)P{x) 



( j pjx + I)Poa(x) 
yY Poa(x + l)P{x) 



and applying m to the function g{x) = x/P{x)/Pox{x) we obtain, 


K{X) > 


> 




( I P{x + 1) 
yY Poa(x + 1) 


^Poa(x) 



Pjx) \ 

Poa(x )J 


= 1-^' 


where n = x/Pix)Pox{x). Therefore, the Hellinger distance ||P — Po(A)||h satisfies 
IIP - Po(A)||?, = (2 - 2/i) < 2(1 - //2) < 2KiX), 

and since ||P — Po(A)||tv < x/\\P - Po(A)||// (see, e.g., [S21 P- 360]) the result follows. □ 

For the proof of Proposition 3, as in the case of normal convergence in Fisher information, 
we exploit the theory of spaces and the fact that scaled score functions of sums are 
conditional expectations (projections) of the original scaled score functions. 

Lemma. Convolution: If X and Y are nonnegative integer-valued random variables with 
probability distributions P and Q and means p and q, respectively, then. 


Px+y{z) = E[axpx{X) + aypriY) \ X + Y = z], 


where ax = p/{p + q), ay = q/{p + q)- 

Proof. Writing F{z -|- 1) = Yhx P{x)Q{z — x -|- 1) for the distribution of X -|- T, we have, 

+ PjP{x)Q{z — X -|-1) 

Px+y{z) = 2^-^-1 




{p + q)F{z) 

xP{x)Q{z — x-l-1) ^ {z — x + l)P{x)Q{z — X -b 1) 


{p + q)F{z) 


{p + q)F{z) 


- 1 


= ax 


y-v xP(x) P(x — l)Q{z — x-l-1) ^ 


pP{x — 1) 


F{z) 


pay 


yy (z — X + l)Q(z — X -b 1) P(x)Q(z — x) ^ 


qQ{z - x) 


F{z) 


(a) 


E 


P{x)Q{z — x) 


F{z) 


axpx{x) + aypy{z - x) 


as required, where (a) follows by moving x to (x -b 1) in the first sum. 


□ 
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Proof of Proposition 3. It suffices to prove the case n = 2. By the Lemma, 

0 < e\^px,{Xi) + ^px,{X2)-px,+x,{Xi + X2)^ 

= E + ?lpx,{X2)]^ - E[px,+x,{Xi + X2)]^ 

therefore, noting that E\px{Xy\ = 0 for any random variable X, 

K{Xi+X 2 ) = {pi+P 2 )E[px,+xAXl+X 2 )f 

< \e\^pxAXi) + ^PxAX2)]^ 

= ^-^{piE[pxAXi)?)+^^{p2E[px,{X2)f) 

= PAK{X^) + Pj.K{X2), 

as claimed. □ 

Proof of Proposition 4- Assume for the moment that the relative entropy between Pt and 
Po(A + t) tends to zero as t ^ oo (this will be established below). Then we can write 
L>(P||Po(A)) as the integral 

D(P||Po(A)) = -^'"|z)(P,||Po(A + t))dt 

= - ^ ((^ + ^) - - H{Xt)) dt 

= (^log(A + t)- ^E[\og{Xt^ + ^H{Xt)^ dt. 

Since the probabilities Pt satisfy a differential-difference equation, ^^{x) = Pt{x — 1) — Pt{x), 
we have, 

—L;[log(At)!] = ^ -^(r)logr! = ^(Pt(r - 1) - Pt(r))logr! = .Elog(At -h 1), 

r r 

and similarly. 

Substituting these two expressions in the expansion of Z)(P||Po(A)) the result follows. 

Finally it remains to establish our initial assumption. If X is the sum of independent 
Bernoulli random variables then it has finite support and Proposition 2 holds; moreover, 
K{X) is easily seen to be finite by Proposition 3. More generally, using Propositions 2 and 3 
we have 

Z)(Pt||Po(A + 1)) < K{X + Po(t)) < x^K{X) ^ 0, 

A + t 

as t ^ oo, as required. 

□ 
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